Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
📊 AI Benchmarks
Specific
benchmark, leaderboard, evaluation, MMLU, evals
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
123
posts in
20.9
ms
You don't need all the
LLM
benchmarks
🏆
LLM Benchmarking
alex.smola.org
·
6d
·
Hacker News
Independent
benchmarks
for self-hosted
AI
📊
AI Performance Profiling
fitmyllm.com
·
2h
Can LLMs Generate Enterprise-Quality Code?
🏆
LLM Benchmarking
startuphub.ai
·
11h
The future will be millions agents running task everyday?
📋
AGENTS.md
github.com
·
1d
·
Hacker News
Resolution Diagnostics for Paired
LLM
Evaluation
📊
Model Evals
arxiv.org
·
3d
Alibaba’s new
AI
model
scores higher than OpenAI, Google rivals in coding ranking
🇨🇳
Chinese AI
scmp.com
·
4d
·
r/SCMPauto
intent-bench/intent-bench
: Intent fulfillment benchmark for agentic
AI
engineering
💻
Coding Agents
github.com
·
3d
·
Hacker News
Low Rank for Rank: Uncertainty-Aware Task-Specific
LLM
Ranking under Sparse Pairwise Comparisons
🏆
LLM Benchmarking
arxiv.org
·
3d
Latent Performance Profiling of Large Language
Models
⚡
LLM Optimization
arxiv.org
·
3d
WBench: A Comprehensive Multi-turn
Benchmark
for Interactive Video World
Model
Evaluation
🌍
World Models
arxiv.org
·
6d
AVBench:
Human-Aligned
and
Automated
Evaluation
Benchmark for Audio-Video Generative Models
👁️
Perceptual Coding
arxiv.org
·
6d
FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning
Benchmarks
🏆
LLM Benchmarking
arxiv.org
·
3d
Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization
Benchmarks
, with a Pilot Audit
⚡
LLM Optimization
arxiv.org
·
3d
MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant
LLM
Inference
⚡
LLM Optimization
arxiv.org
·
3d
Quantifying the Impact of Translation Errors on Multilingual
LLM
Evaluation
🏆
LLM Benchmarking
arxiv.org
·
6d
Benchmarks
are Not Enough: RAMP for Runtime Assessing of Agentic
Models
in Production Systems
🧠
Context Engineering
arxiv.org
·
4d
Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning
🤝
AI-Assisted Coding
arxiv.org
·
3d
FinVerBench:
Benchmark
Validity and Calibration in Large Language
Model
Financial Statement Verification
✅
Document Verification
arxiv.org
·
3d
Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for
LLM
evaluation
⚛️
Quantum Computing
arxiv.org
·
5d
TCP-MCP: Landscape-Guided Co-Evolution of Prompts and Communication Topologies for Multi-Agent Systems
🤝
Agent Systems
arxiv.org
·
4d
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help